feat: let nw.Enum accept categories, map pandas ordered categorical to Enum (only in main namespace, not stable.v1)#2192
Conversation
- add conversion from native to pandas - add conversion from native to Polars
|
thanks! it's encouraging that this doesn't break downstream tests sorry i didn't get round to it for tomorrow's release, will try to get it in for next week's one 👍 |
narwhals/_pandas_like/utils.py
Outdated
| except ImportError as exc: # pragma: no cover | ||
| msg = f"Unable to convert to {dtype} to to the following exception: {exc.msg}" | ||
| raise ImportError(msg) from exc | ||
| return pd.CategoricalDtype(categories=dtype.categories, ordered=True) |
There was a problem hiding this comment.
not sure if we can do something pandas-specific here, as this is used by cudf and modin too - could we generalise?
There was a problem hiding this comment.
I'm a little confused by this.
pandas is already a module-level import?
narwhals/narwhals/_pandas_like/utils.py
Line 13 in 5550ad8
There was a problem hiding this comment.
@dangotbanned you're right- I pulled this code from a pretty old branch I had so that must have just been leftover. I'll delete it.
@MarcoGorelli I'll look into generalizing cudf and modin
| if dtype == "category": | ||
| if native_dtype.ordered: | ||
| return dtypes.Enum(categories=native_dtype.categories) | ||
| return dtypes.Categorical() |
There was a problem hiding this comment.
this would be a breaking change, so i'm not totally sure about it - could we preserve the current behaviour in v1 and only make this change in the main namespace? the version variable is available in this function, you can use that
I noticed a new one in (#2192) and thought I'd get them all in one sweep
* chore(typing): Resolve `_polars.utils` dtype ignores I noticed a new one in (#2192) and thought I'd get them all in one sweep * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * chore: "coverage" Just replacing the original `getattr`, there was already no coverage for that https://github.com/narwhals-dev/narwhals/actions/runs/14145863466/job/39633072966?pr=2312 --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
This will be fixed in a future modin release modin-project/modin#7487
|
thanks Cam - looks like there's a xpass |
Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>
dangotbanned
left a comment
There was a problem hiding this comment.
Thanks for the PR @camriddell!
I've left some non-blocking comments/questions.
Looking pretty ready to me 🎉
| def non_object_native_to_narwhals_dtype(native_dtype: Any, version: Version) -> DType: | ||
| dtype = str(native_dtype) | ||
|
|
There was a problem hiding this comment.
This change seems to have been there since the first commit (3581985), but doesn't seem to be documented?
It looks like this part is related
Which would mean we do the str(...) call twice now. Just an observation, not sure if there is a cost to that
Are all non-object pandas data types guaranteed to be immutable?
I think str was used because it is hashable, so is safe to use in functools.lru_cache
There was a problem hiding this comment.
Which would mean we do the str(...) call twice now. Just an observation, not sure if there is a cost to that
I think the cost of a repeated call to str should be fairly negligible, we can always come back later to refactor if a profiler disagrees with this statement and this leads to a larger overhead.
Are all non-object pandas data types guaranteed to be immutable?
I think str was used because it is hashable, so is safe to use in functools.lru_cache
Since the tests pass, I am at least confident that all of the datatypes are hashable, however whether that hash is something meaningful or just the default id(self) / 16 then caching may not be reliable. That said, perhaps we can also leave as is for now, then if we catch wind of a slow down in the future we can revisit it? Trying to avoid the pre-mature optimization scenarios here.
There was a problem hiding this comment.
@camriddell agreed on the str part.
My concern on the hashability though is related to #2051 (comment)
Right now we won't get a warning like that because we have:
native_dtype: AnyHowever - good news!
I changed it to this locally:
@functools.lru_cache(maxsize=16)
def non_object_native_to_narwhals_dtype(
native_dtype: pd.api.extensions.ExtensionDtype, version: Version
) -> DType:And followed though to the docs to find:
ExtensionDtypes are required to be hashable. The base class provides
Looks like we're all good 🙂
There was a problem hiding this comment.
Great find on that one, thanks so much for diving in there!
| if isinstance(dtype, dtypes.Enum): | ||
| import pandas as pd | ||
|
|
||
| # NOTE: `pandas-stubs.core.dtypes.dtypes.CategoricalDtype.categories` is too narrow | ||
| # Should be one of the `ListLike*` types | ||
| # https://github.com/pandas-dev/pandas-stubs/blob/8434bde95460b996323cc8c0fea7b0a8bb00ea26/pandas-stubs/_typing.pyi#L497-L505 | ||
| return pd.CategoricalDtype(dtype.categories, ordered=True) # pyright: ignore[reportArgumentType] |
There was a problem hiding this comment.
@MarcoGorelli does this seem like it could be widened upstream (https://github.com/pandas-dev/pandas-stubs)?
I've traced back the runtime check to is_list_like:
The current annotation only permits list[Any] as a non-pandas Sequence
There was a problem hiding this comment.
@camriddell ignore this, I only meant to add as a comment - not the review 🫣
There was a problem hiding this comment.
@MarcoGorelli gentle nudge on this, in case it was missed
There was a problem hiding this comment.
hey - yeah, probably, the pandas stubs definitely don't get all the attention they probably deserve
MarcoGorelli
left a comment
There was a problem hiding this comment.
This is going to cause issues for people even just inspecting the schema of a dataframe:
In [1]: import narwhals as nw
In [2]: import pandas as pd
In [3]: s = pd.Series([1,2,3], dtype=pd.CategoricalDtype(ordered=True))
In [4]: nw.from_native(s, series_only=True)
Out[4]:
┌──────────────────────────────────┐
| Narwhals Series |
|----------------------------------|
|0 1 |
|1 2 |
|2 3 |
|dtype: category |
|Categories (3, int64): [1 < 2 < 3]|
└──────────────────────────────────┘
In [5]: nw.from_native(s, series_only=True).dtype
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[5], line 1
----> 1 nw.from_native(s, series_only=True).dtype
File ~/polars-api-compat-dev/narwhals/series.py:368, in Series.dtype(self)
353 @property
354 def dtype(self: Self) -> DType:
355 """Get the data type of the Series.
356
357 Returns:
(...) 366 Int64
367 """
--> 368 return self._compliant_series.dtype
File ~/polars-api-compat-dev/narwhals/_pandas_like/series.py:236, in PandasLikeSeries.dtype(self)
232 @property
233 def dtype(self: Self) -> DType:
234 native_dtype = self.native.dtype
235 return (
--> 236 native_to_narwhals_dtype(native_dtype, self._version, self._implementation)
237 if native_dtype != "object"
238 else object_native_to_narwhals_dtype(
239 self.native, self._version, self._implementation
240 )
241 )
File ~/polars-api-compat-dev/narwhals/_pandas_like/utils.py:321, in native_to_narwhals_dtype(native_dtype, version, implementation)
319 return arrow_native_to_narwhals_dtype(native_dtype.pyarrow_dtype, version)
320 if str_dtype != "object":
--> 321 return non_object_native_to_narwhals_dtype(native_dtype, version)
322 elif implementation is Implementation.DASK:
323 # Per conversations with their maintainers, they don't support arbitrary
324 # objects, so we can just return String.
325 dtypes = import_dtypes_module(version)
File ~/polars-api-compat-dev/narwhals/_pandas_like/utils.py:260, in non_object_native_to_narwhals_dtype(native_dtype, version)
258 return dtypes.Categorical()
259 if native_dtype.ordered:
--> 260 return dtypes.Enum(native_dtype.categories)
261 return dtypes.Categorical()
262 if (match_ := PATTERN_PD_DATETIME.match(dtype)) or (
263 match_ := PATTERN_PA_DATETIME.match(dtype)
264 ):
File ~/polars-api-compat-dev/narwhals/dtypes.py:464, in Enum.__init__(self, categories)
462 if not isinstance(cat, str):
463 msg = f"{type(self).__name__} categories must be strings; found data of type {type(cat).__name__!r}"
--> 464 raise TypeError(msg)
465 seen.add(cat)
466 self.categories = sequence
TypeError: Enum categories must be strings; found data of type 'int'In particular, it would be a breaking change for Altair users, who'd no longer be able to plot pandas dataframes where columns are of categorical dtype and have non-string categories. It's probably not showing up at the moment in the downstream tests because we were careful to use narwhals.stable.v1
That's a good point @MarcoGorelli If someone is currently doing that operation, on import pandas as pd
from narwhals.stable import v1 as nw_v1
s = pd.Series([1, 2, 3], dtype=pd.CategoricalDtype(ordered=True))
>>> nw_v1.from_native(s, series_only=True).dtype
CategoricalSo far we've had two options: I see two other tweaks we could do to option 2 - when we can't meet the constraints of
I think either of those would solve the problem, but I think the simplest is to just keep using |
|
This is the part that would break in Altair: It would also affect Plotly and other libraries I think it's fine to be laxer here - Polars only allows string column names, but we allow pandas dataframe with non-string column names. Similarly, we can allow pandas dataframes with non-string categories I think it's legit to do something like where the categories are taken from user inputs. If the user is starting with something which a backend permits, they can continue with that, no issues |
Well spotted @MarcoGorelli, I stand corrected 😄
I guess I'm just more in the camp of what @camriddell said in (#2192 (comment))
It just seems to me like we're introducing a footgun by deviating from how import pandas as pd
import polars as pl
# NOTE: Strings
>>> pl.Series(pd.Series(["1", "2", "3"], dtype=pd.CategoricalDtype(ordered=True))).to_pandas()
0 1
1 2
2 3
Name: , dtype: category
Categories (3, object): ['1', '2', '3']
# NOTE: Not strings
>>> pl.Series(pd.Series([1, 2, 3], dtype=pd.CategoricalDtype(ordered=True))).to_pandas()
0 1
1 2
2 3
Name: , dtype: int64Important Happy to follow your lead on this @MarcoGorelli, just wanna make sure I've raised my concerns 🙂 |
|
sure, thanks for explaining true, there is a risk that someone writes something which doesn't end up working for polars, but i'd rather accept that risk than disallow people from passing valid pandas dataframes to narwhals i think one reason for narwhals' relatively rapid growth has been that, relative to similar/competing projects, we've put a lot of emphasis on there not being any cost to existing pandas users |
fd5ac63 to
1e3326f
Compare
|
@MarcoGorelli I believe the current version meets the changes you requested? When you have a chance can you take another look? |
@MarcoGorelli just wanna double check this was what you asked for? remove enum duplication/null checks I thought in (#2192 (comment)) you just wanted to allow non-strings - not allow duplicates and But no objections from me |
|
thanks! the backends themselves already disallow duplicates and nulls, so tbh i'm not super-bothered, especially given that people will usually be inspecting schemas of dataframes containing enums rather than making new ones |
nw.Enum accept categories, map pandas ordered categorical to Enum (only in main namespace, not stable.v1)nw.Enum accept categories, map pandas ordered categorical to Enum (only in main namespace, not stable.v1)

What type of PR is this? (check all applicable)
Related issues
Enumtake arguments, allow it in construction #1541Checklist
If you have comments or can explain your changes, please do so below
Adds support for the
nw.Enumdatatype for pandas (backed bypandas.CategoricalDtype(…, ordered=True)The current implementation diverges from pandas/Polars in two broad ways